Deep Neural Networks: Deep Learning
Table of Contents
Machine Learning vs. Deep Learning
Deep Artificial Neural Networks
$=$ Learning or estimating weights and biases of multi-layer perceptron from training data
3 key components
In mathematical expression
$$\begin{align*}
\min_{\omega} \quad &f(\omega)
\end{align*}
$$
$$ \min_{\omega} \sum_{i=1}^{m}\ell\left( h_{\omega}\left(x^{(i)}\right),y^{(i)}\right)$$
Learning weights and biases from data using gradient descent
<font color = 'red', font size = 4>Backpropagation</font>
Backpropagation
Chain Rule
Computing the derivative of the composition of functions
$\space f(g(x))' = f'(g(x))g'(x)$
$\space {dz \over dx} = {dz \over dy} \bullet {dy \over dx}$
$\space {dz \over dw} = ({dz \over dy} \bullet {dy \over dx}) \bullet {dx \over dw}$
$\space {dz \over du} = ({dz \over dy} \bullet {dy \over dx} \bullet {dx \over dw}) \bullet {dw \over du}$
Backpropagation
Update weights recursively with memory
Optimization procedure
The Vanishing Gradient Problem
As more layers using certain activation functions are added to neural networks, the gradients of the loss function approaches zero, making the network hard to train.
In this lecture, we will cover gradient descent algorithm and its variants:
We will explore the concept of these three gradient descent algorithms with a logistic regression model.
(= Gradient Descent)
In Batch Gradient Descent methods, we use all of the training data set for each iteration. So, for each update, we have to sum over all examples.
$$\mathcal{E} (\omega) = \frac{1}{m} \sum_{i=1}^{m} \ell (\hat y_i, y_i) = \frac{1}{m} \sum_{i=1}^{m} \ell (h_{\omega}(x_i), y_i)$$
By linearity,
$$\nabla_{\omega} \mathcal{E} = \nabla_{\omega} \frac{1}{m} \sum_{i=1}^{m} \ell (h_{\omega}(x_i), y_i)= \frac{1}{m} \sum_{i=1}^{m} \frac{\partial }{\partial \omega}\ell (h_{\omega}(x_i), y_i)$$
$$\omega \leftarrow \omega - \alpha \, \nabla_{\omega} \mathcal{E} $$
The main advantages:
Even if it is safe and accurate method, it is very inefficient in terms of computation.
The main disadvantages:
Stochastic Gradient Descent is an extreme case of Mini-batch Gradient Descent. In this case, learning happens on every example. This is less common than the mini-batch gradient descent method.
Update the parameters based on the gradient for a single training example:
$$f(\omega) = \ell (\hat y_i, y_i) = \ell (h_{\omega}(x_i), y_i) = \ell^{(i)}$$
$$\omega \leftarrow \omega - \alpha \, \frac{\partial \ell^{(i)}}{\partial \omega}$$
The advantages:
The disadvantages:
Mathematical justification: if you sample a training example at random, the stochastic gradient is an unbiased estimate of the batch gradient:
$$\mathbb{E} \left[\frac{\partial \ell^{(i)}}{\partial \omega} \right] = \frac{1}{m} \sum_{i=1}^{m} \frac{\partial \ell^{(i)}}{\partial \omega} = \frac{\partial }{\partial \omega} \left[ \frac{1}{m} \sum_{i=1}^{m} \ell^{(i)} \right] = \frac{\partial \mathcal{E}}{\partial \omega}$$
Below is a graph that shows the gradient descent's variants and their direction towards the minimum:
As we can see in figure, SGD direction is very noisy compared to others.
Mini-batch Gradient Descent method is often used in machine learning and deep learning training. The main idea is similar with batch gradient descent. However, unlike batch gradient descent, in this method, we can customize the batch size $s$. Instead of going over all examples $m$, mini-batch gradient descent sums up over lower number of examples based on the batch size $s \;(< m)$. So, parameters are updated based on mini-batch for each iteration. Since we assume the examples in the data set has positive correlation, this method can be used. In other words, in large data set, there are lots of similar examples.
$$\mathcal{E} (\omega) = \frac{1}{s} \sum_{i=1}^{s} \ell (\hat y_i, y_i) = \frac{1}{s} \sum_{i=1}^{s} \ell (h_{\omega}(x_i), y_i) = \frac{1}{s} \sum_{i=1}^{s} \ell^{(i)}$$
$$\omega \leftarrow \omega - \alpha \, \nabla_{\omega} \mathcal{E} $$
Stochastic gradients computed on larger mini-batches have smaller variance:
$$\text{var} \left[ \frac{1}{s} \sum_{i=1}^{s} \frac{\partial \ell^{(i)}}{\partial \omega} \right] = \frac{1}{s^2} \text{var} \left[ \sum_{i=1}^{s} \frac{\partial \ell^{(i)}}{\partial \omega} \right] = \frac{1}{s} \text{var} \left[ \frac{\partial \ell^{(i)}}{\partial \omega} \right]$$
The mini-batch size 𝑠 is a hyper-parameter that needs to be set.
The main advantages of SGD:
The main disadvantages of SGD:
Note that if batch size is equal to number of training examples, mini-batch gradient descent method is same with batch gradient descent.
Summary
tf.nn.dropout(layer, rate = p)Batch normalization is a technique for improving the performance and stability of artificial neural networks.
It is used to normalize the input layer by adjusting and scaling the activations.
During training batch normalization shifts and rescales according to the mean and variance estimated on the batch.
During test, it simply shifts and rescales according to the empirical moments estimated during training.
Tensorflow
Keras
PyTorch
TensorFlow is an open-source software library for deep learning.It’s a framework to perform computation very efficiently, and it can tap into the GPU (Graphics Processor Unit) in order to speed it up even further. This will make a huge effect as we shall see shortly. TensorFlow can be controlled by a simple Python API.
Tensorflow is one of the widely used libraries for implementing machine learning and other algorithms involving large number of mathematical operations. Tensorflow was developed by Google and it’s one of the most popular Machine Learning libraries on GitHub. Google uses Tensorflow for implementing Machine learning in almost all applications.
Tensor
TensorFlow gets its name from tensors, which are arrays of arbitrary dimensionality. A vector is a 1-d array and is known as a 1st-order tensor. A matrix is a 2-d array and a 2nd-order tensor. The "flow" part of the name refers to computation flowing through a graph. Training and inference in a neural network, for example, involves the propagation of matrix computations through many nodes in a computational graph.
To run any of the three defined operations, we need to create a session for that graph. The session will also allocate memory to store the current value of the variable.
When you think of doing things in TensorFlow, you might want to think of creating tensors (like matrices), adding operations (that output other tensors), and then executing the computation (running the computational graph). In particular, it's important to realize that when you add an operation on tensors, it doesn't execute immediately. Rather, TensorFlow waits for you to define all the operations you want to perform. Then, TensorFlow optimizes the computation graph, deciding how to execute the computation, before generating the data. Because of this, a tensor in TensorFlow isn't so much holding the data as a placeholder for holding the data, waiting for the data to arrive when a computation is executed.
tf.constanttf.Variabletf.placeholdertf.constant creates a constant tensor specified by value, dtype, shape and so on.
import tensorflow as tf
a = tf.constant([1,2,3])
b = tf.constant(4, shape=[1,3])
A = a + b
B = a*b
The result of the lines of code is an abstract tensor in the computation graph. However, contrary to what you might expect, the result doesn’t actually get calculated. It just defined the model, but no process ran to calculate the result.
A
B
sess = tf.Session()
sess.run(A)
sess.run(B)
You can also use the following lines of code to start up an interactive Session, run the result and close the Session automatically again after printing the output:
a = tf.constant([1,2,3])
b = tf.constant([4,5,6])
result = tf.multiply(a, b)
with tf.Session() as sess:
output = sess.run(result)
print(output)
tf.Variable is regarded as the decision variable in optimization. We should initialize variables to use tf.Variable.
x1 = tf.Variable([1, 1], dtype = tf.float32)
x2 = tf.Variable([2, 2], dtype = tf.float32)
y = x1 + x2
print(y)
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
sess.run(y)
The value of tf.placeholder must be fed using the feed_dict optional argument to Session.run().
sess = tf.Session()
x = tf.placeholder(tf.float32, shape = [2,2])
sess.run(x, feed_dict = {x : [[1,2],[3,4]]})
a = tf.placeholder(tf.float32, shape = [2])
b = tf.placeholder(tf.float32, shape = [2])
sum = a + b
sess.run(sum, feed_dict = {a : [1,2], b : [3,4]})
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
Overfitting in Regression
N = 10
data_x = np.linspace(-4.5, 4.5, N)
data_y = np.array([0.9819, 0.7973, 1.9737, 0.1838, 1.3180, -0.8361, -0.6591, -2.4701, -2.8122, -6.2512])
data_x = data_x.reshape(-1,1)
data_y = data_y.reshape(-1,1)
plt.figure(figsize = (10,8))
plt.plot(data_x, data_y, 'o')
plt.grid(alpha = 0.3)
plt.show()
n_input = 1
n_hidden1 = 30
n_hidden2 = 100
n_hidden3 = 100
n_hidden4 = 30
n_output = 1
weights = {
'hidden1' : tf.Variable(tf.random_normal([n_input, n_hidden1], stddev = 0.1)),
'hidden2' : tf.Variable(tf.random_normal([n_hidden1, n_hidden2], stddev = 0.1)),
'hidden3' : tf.Variable(tf.random_normal([n_hidden2, n_hidden3], stddev = 0.1)),
'hidden4' : tf.Variable(tf.random_normal([n_hidden3, n_hidden4], stddev = 0.1)),
'output' : tf.Variable(tf.random_normal([n_hidden4, n_output], stddev = 0.1)),
}
biases = {
'hidden1' : tf.Variable(tf.random_normal([n_hidden1], stddev = 0.1)),
'hidden2' : tf.Variable(tf.random_normal([n_hidden2], stddev = 0.1)),
'hidden3' : tf.Variable(tf.random_normal([n_hidden3], stddev = 0.1)),
'hidden4' : tf.Variable(tf.random_normal([n_hidden4], stddev = 0.1)),
'output' : tf.Variable(tf.random_normal([n_output], stddev = 0.1)),
}
x = tf.placeholder(tf.float32, [None, n_input])
y = tf.placeholder(tf.float32, [None, n_output])
def build_model(x, weights, biases):
hidden1 = tf.add(tf.matmul(x, weights['hidden1']), biases['hidden1'])
hidden1 = tf.nn.sigmoid(hidden1)
hidden2 = tf.add(tf.matmul(hidden1, weights['hidden2']), biases['hidden2'])
hidden2 = tf.nn.sigmoid(hidden2)
hidden3 = tf.add(tf.matmul(hidden2, weights['hidden3']), biases['hidden3'])
hidden3 = tf.nn.sigmoid(hidden3)
hidden4 = tf.add(tf.matmul(hidden3, weights['hidden4']), biases['hidden4'])
hidden4 = tf.nn.sigmoid(hidden4)
output = tf.add(tf.matmul(hidden4, weights['output']), biases['output'])
return output
pred = build_model(x, weights, biases)
loss = tf.square(pred - y)
loss = tf.reduce_mean(loss)
LR = 0.001
optm = tf.train.AdamOptimizer(LR).minimize(loss)
n_batch = 50
n_iter = 10000
n_prt = 1000
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
loss_record = []
for epoch in range(n_iter):
idx = np.random.randint(N, size = n_batch)
train_x = data_x[idx,:]
train_y = data_y[idx,:]
sess.run(optm, feed_dict = {x: train_x, y: train_y})
if epoch % n_prt == 0:
c = sess.run(loss, feed_dict = {x: train_x, y: train_y})
loss_record.append(c)
plt.figure(figsize = (10,8))
plt.plot(np.arange(len(loss_record))*n_prt, loss_record, label = 'training')
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.grid(alpha = 0.3)
plt.legend(fontsize = 12)
plt.ylim([0, 10])
plt.show()
xp = np.linspace(-4.5, 4.5, 100).reshape(-1,1)
my_pred = sess.run(pred, feed_dict = {x: xp})
plt.figure(figsize = (10,8))
plt.plot(data_x, data_y, 'o')
plt.plot(xp, my_pred, 'r')
plt.grid(alpha = 0.3)
plt.show()
Dropout Implementation
p = tf.placeholder(tf.float32)
x = tf.placeholder(tf.float32, [None, n_input])
y = tf.placeholder(tf.float32, [None, n_output])
def build_model(x, weights, biases, p):
hidden1 = tf.add(tf.matmul(x, weights['hidden1']), biases['hidden1'])
hidden1 = tf.nn.sigmoid(hidden1)
dropout1 = tf.nn.dropout(hidden1, rate = p)
hidden2 = tf.add(tf.matmul(dropout1, weights['hidden2']), biases['hidden2'])
hidden2 = tf.nn.sigmoid(hidden2)
dropout2 = tf.nn.dropout(hidden2, rate = p)
hidden3 = tf.add(tf.matmul(dropout2, weights['hidden3']), biases['hidden3'])
hidden3 = tf.nn.sigmoid(hidden3)
dropout3 = tf.nn.dropout(hidden3, rate = p)
hidden4 = tf.add(tf.matmul(dropout3, weights['hidden4']), biases['hidden4'])
hidden4 = tf.nn.sigmoid(hidden4)
dropout4 = tf.nn.dropout(hidden4, rate = p)
output = tf.add(tf.matmul(dropout4, weights['output']), biases['output'])
return output
pred = build_model(x, weights, biases, p)
loss = tf.square(pred - y)
loss = tf.reduce_mean(loss)
LR = 0.001
optm = tf.train.AdamOptimizer(LR).minimize(loss)
n_batch = 50
n_iter = 10000
n_prt = 1000
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
loss_record = []
for epoch in range(n_iter):
idx = np.random.randint(N, size = n_batch)
train_x = data_x[idx,:]
train_y = data_y[idx,:]
sess.run(optm, feed_dict = {x: train_x, y: train_y, p: 0.2})
if epoch % n_prt == 0:
c = sess.run(loss, feed_dict = {x: train_x, y: train_y, p: 0.2})
loss_record.append(c)
#print ("Iter : {}".format(epoch))
#print ("Train Cost : {}".format(c))
plt.figure(figsize = (10,8))
plt.plot(np.arange(len(loss_record))*n_prt, loss_record, label = 'training')
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.grid('on', alpha = 0.3)
plt.legend(fontsize = 12)
plt.ylim([0, 10])
plt.show()
xp = np.linspace(-4.5, 4.5, 100).reshape(-1,1)
my_pred = sess.run(pred, feed_dict = {x: xp, p: 0})
plt.figure(figsize = (10,8))
plt.plot(data_x, data_y, 'o')
plt.plot(xp, my_pred, 'r')
plt.grid(alpha = 0.3)
plt.show()
Batch Normalization Implementation
is_training = tf.placeholder(tf.bool)
x = tf.placeholder(tf.float32, [None, n_input])
y = tf.placeholder(tf.float32, [None, n_output])
def build_model(x, weights, biases, is_training):
hidden1 = tf.add(tf.matmul(x, weights['hidden1']), biases['hidden1'])
hidden1 = tf.layers.batch_normalization(hidden1, training = is_training)
hidden1 = tf.nn.sigmoid(hidden1)
hidden2 = tf.add(tf.matmul(hidden1, weights['hidden2']), biases['hidden2'])
hidden2 = tf.layers.batch_normalization(hidden2, training = is_training)
hidden2 = tf.nn.sigmoid(hidden2)
hidden3 = tf.add(tf.matmul(hidden2, weights['hidden3']), biases['hidden3'])
hidden3 = tf.layers.batch_normalization(hidden3, training = is_training)
hidden3 = tf.nn.sigmoid(hidden3)
hidden4 = tf.add(tf.matmul(hidden3, weights['hidden4']), biases['hidden4'])
hidden4 = tf.layers.batch_normalization(hidden4, training = is_training)
hidden4 = tf.nn.sigmoid(hidden4)
output = tf.add(tf.matmul(hidden4, weights['output']), biases['output'])
return output
pred = build_model(x, weights, biases, is_training)
loss = tf.square(pred - y)
loss = tf.reduce_mean(loss)
LR = 0.001
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
optm = tf.train.AdamOptimizer(LR).minimize(loss)
tf.get_default_graph().get_all_collection_keys()
tf.get_collection('trainable_variables')
tf.get_collection('variables')
tf.get_collection('update_ops')
n_batch = 50
n_iter = 10000
n_prt = 1000
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
loss_record = []
for epoch in range(n_iter):
idx = np.random.randint(N, size = n_batch)
train_x = data_x[idx,:]
train_y = data_y[idx,:]
sess.run(optm, feed_dict = {x: train_x, y: train_y, is_training: True})
if epoch % n_prt == 0:
c = sess.run(loss, feed_dict = {x: train_x, y: train_y, is_training: True})
loss_record.append(c)
plt.figure(figsize = (10,8))
plt.plot(np.arange(len(loss_record))*n_prt, loss_record, label = 'training')
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.grid('on', alpha = 0.3)
plt.legend(fontsize = 12)
plt.ylim([0, 10])
plt.show()
xp = np.linspace(-4.5, 4.5, 100).reshape(-1,1)
my_pred = sess.run(pred, feed_dict = {x: xp, is_training: False})
plt.figure(figsize = (10,8))
plt.plot(data_x, data_y, 'o')
plt.plot(xp, my_pred, 'r')
plt.grid(alpha = 0.3)
plt.show()
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')